Previously, we discussed the notion of similarity among documents.
Similarity relied on the assumption that each document can be represented by a vector of (weighted) feature counts
Some of the ways to measure document similarity included:
Edit distances
Inner product
Euclidean distance
Cosine similarity
Application
If we use the same dataset as before, we want to see how similar politicians are.
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Group the DataFrame by the 'name' column.# For each group, take the 'body' column and concatenate all text entries into a single string, separated by spaces.# Reset the index of the resulting DataFrame so that 'name' becomes a standard column instead of an index.aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()aggression_texts_aggregated.head(8)
name body
0 Adam Afriyie I welcomed much of what was said by the honou...
1 Adam Holloway What recent discussions he has had on the futu...
2 Adam Ingram I think I said that it was an additional power...
3 Adam Price There has been much talk in the Chamber this a...
4 Adrian Bailey Given the failure of successive well-intention...
5 Adrian Sanders I do not know whether the honourable Gentleman...
6 Afzal Khan What recent progress the Government have made ...
7 Aidan Burley I should start with a declaration of interest ...
Application
Creating Count vectorized representation of speeches
We first create the Count vectorized representation of speeches
Python
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similarity# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])
We then select the politician of interest
Python
#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()
Application
Calculating the Cosine Similarity
We then create a more friendly dataframe that we can visualize:
Python
# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_df.head())
name_politician cosine_similarity
145 Boris Johnson 1.000000
1590 William Hague 0.964526
283 David Miliband 0.964144
278 David Lidington 0.963183
1356 Philip Hammond 0.963094
Application
Comparing to other Speeches
See code
R
library(dplyr)library(ggplot2)similarity_df2 <- reticulate::py$similarity_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
Comparing it with Other Speeches
As you can see below, most speeches are smiliar to one another
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_df# Create the bar chartggplot(similarity_df2, aes(x = cosine_similarity)) +geom_histogram(binwidth =0.05, color ="white", alpha =0.7) +labs(title ="Histogram of Cosine Similarities",x ="Cosine Similarity",y ="Frequency")+theme_bw()
Application
Misleading Word Counts
Let us compare the most common features:
Python
# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
35017 the 85
35375 to 43
24497 of 34
35010 that 33
2684 and 28
19341 is 25
18172 in 22
38445 will 13
Application
Misleading Word Counts
Feature selection matters! Similarities here are being driven by substantively unimportant words.
One solution would be to remove stopwords and try again.
Application
Creating TF-IDF vectorized representation of speeches
We can remove stopwords.
Python
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom nltk.corpus import stopwordsimport nltk# Download the NLTK stopword list if not already donenltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Vectorize the cleaned textvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Find the index of the selected politician (Boris Johnson)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similaritycosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_new_df.head())
True
name_politician cosine_similarity
145 Boris Johnson 1.000000
256 David Cameron 0.569758
1018 Mr John Major 0.569098
278 David Lidington 0.566099
1564 Tony Blair 0.562866
Application
Creating TF-IDF vectorized representation of speeches
We then select the politician of interest
Python
# Find the index of the selected politician (Boris Johnson)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similaritycosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})
Application
Creating a Vectorized representation of speeches
We then select the politician of interest
Python
# Sort by cosine similarity in descending ordersimilarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_new_df.head())
name_politician cosine_similarity
145 Boris Johnson 1.000000
256 David Cameron 0.569758
1018 Mr John Major 0.569098
278 David Lidington 0.566099
1564 Tony Blair 0.562866
Application
Aftter removing Misleading Word Counts
This is what the output looks like if we remove the stopwords.
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_new_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 0.7)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
Aftter removing Misleading Word Counts
This is what the original dataframe looks like compares:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
See code
Python
# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
After removing Misleading Word Counts
And here are the leading words.
Python
# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for Boris Johnsontext_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)top_features
term freq
17343 honourable 11
38748 would 7
28951 referendum 6
22764 minister 6
14121 fields 5
14904 free 5
26525 playing 5
17490 house 4
Application
After removing Misleading Word Counts
And this is how the top words compare:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
35017 the 85
35375 to 43
24497 of 34
35010 that 33
2684 and 28
19341 is 25
18172 in 22
38445 will 13
See code
Python
# Download the NLTK stopword list if not already done#nltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Vectorize the cleaned textvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
17343 honourable 11
38748 would 7
28951 referendum 6
22764 minister 6
14121 fields 5
14904 free 5
26525 playing 5
17490 house 4
Caveats
When comparing the speeches of Boris Johnson to other politicians, we characterises speeches according to the raw counts of each word.
We used the raw term frequency to characterize similarity:
each word is considered equally important.
This way to count words is called bag-of-words: a model using a representation of text that is based on an unordered collection.
Caveats: bag-of-words
The following models a text document using bag-of-words. Here are two simple text documents:
(1) John likes to watch movies. Mary likes movies too.
(2) Mary also likes to watch football games.
Based on these two text documents, a list is constructed as follows for each document:
\(W_{i,j}\) - the number of times feature \(j\) appears in document \(i\)
\(df_{j}\) - the number of documents in the corpus that contain feature \(j\)
\(N\) - the total number of documents
We use \(log(\frac{N}{df_j})\) rather than \(\frac{N}{df_j}\) in order to avoid large weights on rare words.
Tf-idf intuition
Characteristics
Tf-idf will be highest when feature \(j\) occurs many times in a small number of documents
Tf-idf will be lower when feature \(j\) occurs few times in a document, or occurs in many documents
Tf-idf will be lowest when feature \(j\) occurs in virtually all documents
Tf-idf Application
What are most common words in Boris Johnson’s speech
See code
Python
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom nltk.corpus import stopwordsimport nltk# Download NLTK stopwords (if not already done)#nltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwords functiondef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Load the datasetaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Aggregate the texts by "name"aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciescount_vectorizer = CountVectorizer()text_dfm = count_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Create a TfidfVectorizer for TF-IDF scorestfidf_vectorizer = TfidfVectorizer()tfidf_dfm = tfidf_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = tfidf_vectorizer.get_feature_names_out()# Identify the index for 'Boris Johnson' (replace 'Boris Johnson' with the actual name you want)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Get the TF-IDF scores for 'Boris Johnson'tfidf_scores = tfidf_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingtfidf_df = pd.DataFrame({'term': feature_names,'tfidf': tfidf_scores})# Get the top 8 features by TF-IDF scoretop_tfidf_features = tfidf_df.sort_values(by='tfidf', ascending=False).head(8)print(top_tfidf_features)
term tfidf
14121 fields 0.262896
28951 referendum 0.196836
26525 playing 0.185771
19541 jcpoa 0.160847
19387 israel 0.140571
9169 criminalised 0.137572
17343 honourable 0.135140
23696 nato 0.133815
Tf-idf Application
We now look at similarities to other politicians
See code
Python
# Selecting the politician of interest (replace 'Boris Johnson' with the actual name)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_tfidf = cosine_similarity(tfidf_dfm[boris_index], tfidf_dfm).flatten()# Add cosine similarity to the DataFrameaggression_texts_aggregated['cosine_similarity_to_boris'] = cosine_sim_tfidf# Sort the DataFrame by cosine similarity (highest similarity first)similarity_results_tfidf = aggression_texts_aggregated.sort_values(by='cosine_similarity_to_boris', ascending=False)# Rename the columnssimilarity_results_tfidf = similarity_results_tfidf.rename(columns={'name': 'name_politician','cosine_similarity_to_boris': 'cosine_similarity'})
See code
R
library(dplyr)similarity_results_tfidf2 <- reticulate::py$similarity_results_tfidfsimilarity_results_tfidf2<-head(similarity_results_tfidf2, 11)similarity_results_tfidf2<-subset(similarity_results_tfidf2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_results_tfidf2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
After removing Misleading Word Counts
This is what the original dataframe looks like compared:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
See code
Python
# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_dfsimilarity_df2<-head(similarity_df2, 11)similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
See code
Python
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom nltk.corpus import stopwordsimport nltk# Download NLTK stopwords (if not already done)#nltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwords functiondef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Load the datasetaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Aggregate the texts by "name"aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciescount_vectorizer = CountVectorizer()text_dfm = count_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Create a TfidfVectorizer for TF-IDF scorestfidf_vectorizer = TfidfVectorizer()tfidf_dfm = tfidf_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = tfidf_vectorizer.get_feature_names_out()# Identify the index for 'Boris Johnson' (replace 'Boris Johnson' with the actual name you want)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_tfidf = cosine_similarity(tfidf_dfm[boris_index], tfidf_dfm).flatten()# Add cosine similarity to the DataFrameaggression_texts_aggregated['cosine_similarity_to_boris'] = cosine_sim_tfidf# Sort the DataFrame by cosine similarity (highest similarity first)similarity_results_tfidf = aggression_texts_aggregated.sort_values(by='cosine_similarity_to_boris', ascending=False)# Rename the columnssimilarity_results_tfidf = similarity_results_tfidf.rename(columns={'name': 'name_politician','cosine_similarity_to_boris': 'cosine_similarity'})
See code
R
library(dplyr)similarity_results_tfidf2 <- reticulate::py$similarity_results_tfidfsimilarity_results_tfidf2<-head(similarity_results_tfidf2, 11)similarity_results_tfidf2<-subset(similarity_results_tfidf2, name_politician!="Boris Johnson")# Create the bar chartggplot(similarity_results_tfidf2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Caveats Associated with Tf-idf Cosine similarity
There are however some caveats associated with Cosine Similarity
Note
Text 1: “Artificial intelligence has revolutionized text processing” Text 2: “Progress in computational linguistics is dramatic”
The message of these two sentences is pretty much the the same.
index
artificial
intelligence
revolutionized
text
processing
progress
in
computational
linguistics
is
dramatic
text_1
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
text_2
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
Yet the Cosine Similarity is 0: there are non-overlapping sets of words.
\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}=0\]
Word-embedding approaches are the solution to this. More on this in future lectures.
Word Clouds
A common way to visualize differences is by using word clouds
Word Clouds are visual representations of the frequency and importance of words in a given text.
the size of words indicates the frequency or its importance within a text
Word Clouds: Johnson vs. Cameron
Python
import pandas as pdfrom nltk.corpus import stopwordsfrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport refrom nltk.corpus import stopwords# Load the datasetfile_path ="/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv"aggression_texts = pd.read_csv(file_path)text_johnson = aggression_texts.loc[aggression_texts['name'] =='Boris Johnson']text_cameron = aggression_texts.loc[aggression_texts['name'] =='David Cameron']# Load English stopwordsstop_words =set(stopwords.words('english'))# Function to clean text by removing stopwordsdef clean_text(text):# Remove punctuation using regex text = re.sub(r'[^\w\s]', '', str(text))# Remove stopwordsreturn' '.join([word for word in text.split() if word.lower() notin stop_words])# Apply text cleaning to the datasetstext_johnson['cleaned_text'] = text_johnson['body'].apply(clean_text)text_cameron['cleaned_text'] = text_cameron['body'].apply(clean_text)
library(quanteda)library(gridExtra)library(png)library(grid)# Save the word clouds to temporary image filespng("johnson_wordcloud.png", width =800, height =800)textplot_wordcloud(text_johnson2, max_words =300)garbage <-dev.off()png("cameron_wordcloud.png", width =800, height =800)textplot_wordcloud(text_cameron2, max_words =300)garbage <-dev.off()# Read the images back as grobsjohnson_grob <-rasterGrob(readPNG("johnson_wordcloud.png"))cameron_grob <-rasterGrob(readPNG("cameron_wordcloud.png"))# Arrange the grobs side by sidegrid.arrange(johnson_grob, cameron_grob, ncol =2)
Even here, it is difficult to identify distinguishing words.
The primary difficulty is the fact that the X and Y axes are meaningless.
Fightin’ Words
One approach is to visualise the difference in word use across groups by using the Fightin’ Words method (Munroe et al. 2008)
This starts with calculating the probability of observing a given word for a given category of documents:
\(W^{*}_{j,k}\) - number of times feature \(j\) appears in documents in category \(k\) \(n_k\) the total number of tokens in documents in category \(k\) \(a_j\) “regularization” parameter which shrinks differences in very common words towards 0
Fightin’ Words
We then take the log-odds for category \(k\) and \(k'\):
This ratio estimates the relative probability of the use of word \(j\) between the two groups.
When the ratio is positive, group \(k\) uses the word more often. When it is negative, group \(k'\) uses it more often.
Fightin’ Words
The final step is to standardize the ratio by its variance.
\[
\textrm{Fightin' Words Score}_j =
\frac{log-odds-ratio_{j,k}}
{\sqrt{Var(log-odds-ratio_{j,k})}}
\]
Fightin’ Words
Here is how we implement this technique in a function.
Python
import pandas as pdimport numpy as npfrom sklearn.feature_extraction.text import CountVectorizerfrom nltk.corpus import stopwordsimport reimport nltk# Download NLTK stopwords if not already availablenltk.download('stopwords')# Text preprocessingdef preprocess_text(text):# Remove punctuation, symbols, and numbers text = re.sub(r'[^\w\s]', '', text) # Remove punctuation and symbols text = re.sub(r'\d+', '', text) # Remove numbers# Convert to lowercase text = text.lower()# Remove stopwords stop_words =set(stopwords.words('english')) text =" ".join(word for word in text.split() if word notin stop_words)return text# Apply preprocessing to the 'body' columnaggression_texts['body'] = aggression_texts['body'].apply(preprocess_text)# Initialize CountVectorizervectorizer = CountVectorizer()# Subset text data for Johnson and Camerontext_group1 = aggression_texts.loc[aggression_texts['name'] =='Boris Johnson']text_group2 = aggression_texts.loc[aggression_texts['name'] =='David Cameron']# Create document-term matrices (DFMs) for each groupdfm_input1 = vectorizer.fit_transform(text_group1['body'])dfm_input2 = vectorizer.transform(text_group2['body'])# Get feature namesfeatures = vectorizer.get_feature_names_out()# Sum term frequencies for each groupcounts_group1 = np.array(dfm_input1.sum(axis=0)).flatten()counts_group2 = np.array(dfm_input2.sum(axis=0)).flatten()
True
Fightin’ Words
Here is how we implement this technique in a function.